Unit 2 Module 1

Spatial Data Analysis with R (01:450:320)

1 Working with spatial vector data

In this section we will start to work with spatial vector data, learning the most common forms of vector-based spatial operations. We will be working primarily with the sf package, which replaces the earlier sp package.

The material in this section assumes that the reader is familiar with standard GIS operations and concepts, ranging from projections and transformations to intersections, buffers, and differences.

2 Preparation

2.1 Set-up

Let’s start by loading sdadata.

# A reminder of how to install
# devtools::install_github("SSELP/sdadata")
library(sdadata)

sdadata imports sf. sf depends on the following external libraries (from here):

SystemRequirements: GDAL (>= 2.0.1), GEOS (>= 3.4.0), PROJ (>= 4.8.0), sqlite3

These are external libraries that provide key geospatial capabilities that are used by R and many other languages and platforms that can be used for geospatial analytics. If you are using the docker container, these are already installed so you don’t have to worry about them. Read more about them by following these links: GDAL, GEOS, PROJ, sqlite3

You should already have your class Rstudio project (e.g. ls320) setup, with a notebooks/ folder within it. If you don’t already have it, create a data/ folder within notebooks/, which is where you will practice writing spatial data files.

3 Reading, writing, and making vector data

This reading introduces reading, writing, and making vector data, how to work with their coordinate reference systems (CRS), and how to do some basic plotting, leavened with some additional R programming examples. However, the first thing we need to do is learn about simple features, which is the representation of vector data that is employed by sf (which stands for simple features).

3.1 Simple features, explained

To learn more about what simple features are and how they work, and why R developed a package for them, please read the sf introductory vignette written by Edzer Pebesma, the primary sf package developer.

3.2 From non-spatial to spatial

We are going to start by reading in the species dataset, which installs with sdadata. It contains occurrence records for the “Big Five” (Elephant, Lion, Leopard, Buffalo, and Rhinoceros) in Tanzania, sourced from the Global Biodiversity Information Facility (GBIF). This dataset is a subset that contains the following variables:

  • family: The biological family to which the species belongs (e.g., Felidae or Elephantidae).
  • species: The scientific name of the animal (e.g., Panthera leo).
  • month: The month of the year the observation was recorded (1−12)
  • year: The year of the observation, allowing for analysis of trends over time.
  • x: Longitude in decimal degrees.
  • y: Latitude in decimal degrees.

Here’s a quick look:

# Chunk 1
species <- read_csv(system.file("extdata/species.csv", package = "sdadata"))
#
# 1
set.seed(30)
species %>% 
  sample_n(5) %>% 
  arrange(year, month)
#> # A tibble: 5 × 6
#>   family       species            month  year     x     y
#>   <chr>        <chr>              <dbl> <dbl> <dbl> <dbl>
#> 1 Felidae      Panthera leo          10  2018  34.9 -2.42
#> 2 Elephantidae Loxodonta africana     1  2023  36.1 -4.05
#> 3 Felidae      Panthera leo           1  2024  36.1 -3.78
#> 4 Elephantidae Loxodonta africana     8  2024  36.1 -3.67
#> 5 Elephantidae Loxodonta africana     3  2025  34.9 -2.60
#
# #2
species %>% 
  distinct(species) %>% 
  count()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1     5
# 
# #3
species %>% 
  group_by(species, year) %>% 
  count() %>% 
  arrange(year)
#> # A tibble: 133 × 3
#> # Groups:   species, year [133]
#>    species             year     n
#>    <chr>              <dbl> <int>
#>  1 Loxodonta africana  2000     4
#>  2 Panthera leo        2000     2
#>  3 Panthera pardus     2000     2
#>  4 Syncerus caffer     2000     4
#>  5 Loxodonta africana  2001     5
#>  6 Panthera leo        2001     9
#>  7 Panthera pardus     2001     2
#>  8 Syncerus caffer     2001     2
#>  9 Diceros bicornis    2002     1
#> 10 Loxodonta africana  2002     4
#> # ℹ 123 more rows

Okay, #1 shows us that this dataset, read in as species, is a tibble, which means it is non-spatial at the moment, even though its comes with coordinates. #2 shows us that there are 5 species (big five) in the dataset, and #3 shows a varying number of observations per species per year.

Given that these are point data, we can turn them into a facsimile of spatial data by simply plotting them and having a look.

# Chunk 2
ggplot(species %>% distinct(species, x, y)) + 
  geom_point(aes(x, y))

To avoid possible duplicate points, we first cut species down to just unique species records and coordinates (GBIF data is messy), and then use gglot to shows us, in a very crude way, where these points lie in space. This works, because species$x (the longitude) is exactly analogous to the x coordinate of a scatter plot, and species$y (latitude) is the y coordinate. But it is still not a spatial object. So, let’s convert it to one, and then plot it again.

# Chunk 3
# #1
species <- st_as_sf(species, coords = c("x", "y"))
species
#> Simple feature collection with 6951 features and 4 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 29.91544 ymin: -11.09595 xmax: 38.79323 ymax: -1.44546
#> CRS:           NA
#> # A tibble: 6,951 × 5
#>    family       species            month  year             geometry
#>    <chr>        <chr>              <dbl> <dbl>              <POINT>
#>  1 Elephantidae Loxodonta africana     6  2012 (34.08678 -2.276964)
#>  2 Bovidae      Syncerus caffer        6  2012   (35.75882 -3.5614)
#>  3 Bovidae      Syncerus caffer        6  2012 (35.44965 -3.230564)
#>  4 Felidae      Panthera pardus        6  2012 (34.96447 -2.475424)
#>  5 Elephantidae Loxodonta africana     6  2012 (35.13262 -1.883995)
#>  6 Bovidae      Syncerus caffer        6  2012   (35.73392 -3.0035)
#>  7 Elephantidae Loxodonta africana     6  2012 (35.61116 -3.004654)
#>  8 Elephantidae Loxodonta africana     6  2012 (35.97219 -3.429279)
#>  9 Bovidae      Syncerus caffer        6  2012 (35.12084 -1.948731)
#> 10 Elephantidae Loxodonta africana     6  2012 (35.03655 -1.983431)
#> # ℹ 6,941 more rows
#
# #2
str(species)
#> sf [6,951 × 5] (S3: sf/tbl_df/tbl/data.frame)
#>  $ family  : chr [1:6951] "Elephantidae" "Bovidae" "Bovidae" "Felidae" ...
#>  $ species : chr [1:6951] "Loxodonta africana" "Syncerus caffer" "Syncerus caffer" "Panthera pardus" ...
#>  $ month   : num [1:6951] 6 6 6 6 6 6 6 6 6 6 ...
#>  $ year    : num [1:6951] 2012 2012 2012 2012 2012 ...
#>  $ geometry:sfc_POINT of length 6951; first list element:  'XY' num [1:2] 34.09 -2.28
#>  - attr(*, "sf_column")= chr "geometry"
#>  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA
#>   ..- attr(*, "names")= chr [1:4] "family" "species" "month" "year"
#
# #2
plot(st_geometry(species), pch = 20, cex = 0.5)

In #1, we use our first function from sf, which is st_as_sf, which is used to “convert foreign object to an sf object” (?st_as_sf). In this case we coerce a data.frame/tibble into an sf POINT object, by specifying the x and y coordinates for the “coords” argument. The summary of species, now an sf object, displays metadata describing the number of features and variables, the geometry type, the number of dimensions, the object’s bounding box (bbox), its spatial reference system information (EPSG string and proj4string; more on these later), which are currently not set, and then it shows the tibble with values. x and y no longer occur, and have been replaced by a geometry column now.

Looking at str in #2, we see the object is constructed of 4 classes (sf, tbl_df, tbl, and data.frame), with the geometry variable being of class sfc_POINT. Please refer back to the how simple features are organized section on the sf website to understand how the geometry is structured.

We then plot the new sf object, using st_geometry to strip out the geometries of this object, dropping the other variables in the tibble with associated with the object. The reason for this is that plot (sf’s generic for plotting sf objects) would otherwise default to creating one plot panel for each variable in the tibble. We can see this behavior here by comparison:

# Chunk 4
## With st_geometry
plot(species %>% slice(1:10) %>% st_geometry(), pch = 20, cex = 0.5)

## No st_geometry
plot(species %>% slice(1:10), pch = 20, cex = 0.5)

In the above, we illustrate this multi-panel plotting behavior using just the first 10 rows of species, using dplyr syntax to slice out those rows, in the process illustrating that sf is designed to work with the tidyverse.

So, in Chunk 3 above, we have seen how we can coerce the non-spatial species into a spatial object sf. We can coerce species back to an ordinary data.frame/tibble by simply doing this:

# Chunk 5
species %>% as_tibble
#> # A tibble: 6,951 × 5
#>    family       species            month  year             geometry
#>    <chr>        <chr>              <dbl> <dbl>              <POINT>
#>  1 Elephantidae Loxodonta africana     6  2012 (34.08678 -2.276964)
#>  2 Bovidae      Syncerus caffer        6  2012   (35.75882 -3.5614)
#>  3 Bovidae      Syncerus caffer        6  2012 (35.44965 -3.230564)
#>  4 Felidae      Panthera pardus        6  2012 (34.96447 -2.475424)
#>  5 Elephantidae Loxodonta africana     6  2012 (35.13262 -1.883995)
#>  6 Bovidae      Syncerus caffer        6  2012   (35.73392 -3.0035)
#>  7 Elephantidae Loxodonta africana     6  2012 (35.61116 -3.004654)
#>  8 Elephantidae Loxodonta africana     6  2012 (35.97219 -3.429279)
#>  9 Bovidae      Syncerus caffer        6  2012 (35.12084 -1.948731)
#> 10 Elephantidae Loxodonta africana     6  2012 (35.03655 -1.983431)
#> # ℹ 6,941 more rows

This gets us back to a tibble, but the x and y variables do not reappear, instead we still have the geometry column. If we want to get back to the original structure of the tibble, its more complicated, but certainly possible:

# Chunk 6
bind_cols(species %>% st_drop_geometry(), 
          st_coordinates(species) %>% as_tibble()) %>% 
  select(family, species, month, year, X, Y) %>% 
  rename(x = X, y = Y)
#> # A tibble: 6,951 × 6
#>    family       species            month  year     x     y
#>    <chr>        <chr>              <dbl> <dbl> <dbl> <dbl>
#>  1 Elephantidae Loxodonta africana     6  2012  34.1 -2.28
#>  2 Bovidae      Syncerus caffer        6  2012  35.8 -3.56
#>  3 Bovidae      Syncerus caffer        6  2012  35.4 -3.23
#>  4 Felidae      Panthera pardus        6  2012  35.0 -2.48
#>  5 Elephantidae Loxodonta africana     6  2012  35.1 -1.88
#>  6 Bovidae      Syncerus caffer        6  2012  35.7 -3.00
#>  7 Elephantidae Loxodonta africana     6  2012  35.6 -3.00
#>  8 Elephantidae Loxodonta africana     6  2012  36.0 -3.43
#>  9 Bovidae      Syncerus caffer        6  2012  35.1 -1.95
#> 10 Elephantidae Loxodonta africana     6  2012  35.0 -1.98
#> # ℹ 6,941 more rows

We drop the geometry of species, and then we use dplyr::bind_cols (equivalent to cbind) to bind the coordinates back to the tibble. The coordinates are obtained using st_coordinates, an sf function used to “retrieve coordinates in matrix form” (?st_coordinate), and we have to coerce that to a tibble for bind_cols to work. We can use select to reorder the columns in their original form, and rename the coordinate columns back to their original lower case (st_coordinates returns X and Y columns). This is a standard way for geoscientists to convert spatial data back and forth.

3.3 Read and write spatial data

That was a very quick dip into spatial waters, using a points example. We are going to learn to read and write (in reverse order) spatial data. We still have species back as an sf object. We are now going to write it out to a spatial file.

# Chunk 6
st_write(obj = species, dsn = file.path(tempdir(), "species.shp"), 
         delete_dsn = TRUE)
#> writing: substituting ENGCRS["Undefined Cartesian SRS with unknown unit"] for missing CRS
#> Warning in CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), : GDAL Error 1:
#> /var/folders/8z/trk9m2ks7ksd2s963fkz3dg80000gq/T//RtmpXrss7r/species.shp does not appear to be a
#> file or directory.
#> Deleting source `/var/folders/8z/trk9m2ks7ksd2s963fkz3dg80000gq/T//RtmpXrss7r/species.shp' failed
#> Writing layer `species' to data source 
#>   `/var/folders/8z/trk9m2ks7ksd2s963fkz3dg80000gq/T//RtmpXrss7r/species.shp' using driver `ESRI Shapefile'
#> Writing 6951 features with 4 fields and geometry type Point.
dir(tempdir(), pattern = "species")
#> [1] "species.dbf" "species.prj" "species.shp" "species.shx"

The code above uses st_write to create that old standby, the ESRI shapefile, that clunky, annoying format that has many files for one thing. Note that I have the option delete_dsn = TRUE set, which is needed if the file already exists in that location, and you want to recreate it (which I did in this case, because I was re-running this code many times as I wrote this). For st_write, the “obj” argument asks for the sf object to write, “dsn” is the name and path for the output file. By adding the “.shp” at the end of the filename, st_write knows to use the “ESRI Shapefile” driver. For this example, I used again the tempdir() function to write this object to a temporary directory, but, as you follow along, you should change the file path to correctly lead to your own xyz320/notebooks/data path.

Looking in the output directory here, we see that there are only three files now (.dbf, .shp, and .shx). We are missing a .prj because of the following reason:

# Chunk 7
species %>% st_crs()
#> Coordinate Reference System: NA

species does not have a coordinate reference system (CRS) associated with it yet, even though we can tell it is in geographic coordinates. Remember this, because we will come back to it.

Because .shp is an annoying format, we can try other types. Let’s do the much more convenient “geojson” format, which produces a single file.

# Chunk 8
st_write(obj = species, dsn = file.path(tempdir(), "species.geojson"), 
         delete_dsn = TRUE)
#> writing: substituting ENGCRS["Undefined Cartesian SRS with unknown unit"] for missing CRS
#> Deleting source `/var/folders/8z/trk9m2ks7ksd2s963fkz3dg80000gq/T//RtmpXrss7r/species.geojson' failed
#> Writing layer `species' to data source 
#>   `/var/folders/8z/trk9m2ks7ksd2s963fkz3dg80000gq/T//RtmpXrss7r/species.geojson' using driver `GeoJSON'
#> Writing 6951 features with 4 fields and geometry type Point.
dir(tempdir(), pattern = "species")
#> [1] "species.dbf"     "species.geojson" "species.prj"     "species.shp"     "species.shx"

We see that we have just a single file to hold the .geojson, which is much nicer, and is an increasingly widely used file format, so you should use it whenever possible in place of .shp. However, we will stick with ESRI shapefiles for this class, since they are so widely used still. So let’s read back in from the shapefile itself. I am going the first delete the .geojson version, for the sake of directory cleanliness (having shown you a nicer format, I am not going to work with because so much is still done with

# Chunk 9
file.remove(dir(tempdir(), pattern = "species.geojson", full.names = TRUE))
#> [1] TRUE
rm(species) 
species <- st_read(dir(tempdir(), pattern = "species.shp", full.names = TRUE))
#> Reading layer `species' from data source 
#>   `/private/var/folders/8z/trk9m2ks7ksd2s963fkz3dg80000gq/T/RtmpXrss7r/species.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 6951 features and 4 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 29.91544 ymin: -11.09595 xmax: 38.79323 ymax: -1.44546
#> Projected CRS: Undefined Cartesian SRS with unknown unit

We used file.remove to get rid of the .sqlite, and then used rm to get rid of the existing species sf object, and then read back in the species we had written to the .shp. So we have species back again.

Now, there are two more datasets that come with sdadata to read in using st_read: roads.geojson and districts.geojson. The paths to both are found with the system.file function. I’ll set up the first one for you.

# Chunk 10
fnm <- system.file("extdata/roads.geojson", package = "sdadata")
roads <- st_read(dsn = fnm) #%>% st_set_crs("ESRI:102022")
#> Reading layer `roads' from data source 
#>   `/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/roads.geojson' 
#>   using driver `GeoJSON'
#> Simple feature collection with 4569 features and 1 field
#> Geometry type: MULTILINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 478749.8 ymin: -1385619 xmax: 1591588 ymax: -118296.2
#> Projected CRS: Africa_Albers_Equal_Area_Conic
fnm <- system.file("extdata/districts.geojson", package = "sdadata")
districts <- st_read(dsn = fnm)
#> Reading layer `districts' from data source 
#>   `/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/districts.geojson' 
#>   using driver `GeoJSON'
#> Simple feature collection with 170 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS:  WGS 84

Let’s end this section by a quick look at these data. We are going to plot districts, roads, and species onto one plot.

# Chunk 11
par(mar = c(0, 0, 0, 0))
plot(st_geometry(districts), col = "grey")
plot(st_geometry(roads), col = "red", add = TRUE)
plot(st_geometry(species), pch = 20, reset = FALSE, col = "blue", add = TRUE)

So what have we done? First, we use the par function with the mar (margin) argument and a 4-element vector of 0s to remove the inner margins of the plot so that the map of Tanzania fills the plot frame. We then use plot to draw the map of districts (wrapped in st_geometry, to just grab the geometry part), which are administrative boundaries for Tanzania. We then use lines to add roads on top of that, making them red. Finally, we add the species data as blue points. Note the use of add = TRUE, which is necessary to overlay the two plots on the district boundaries without creating a new plot of each.

That’s great, you say, but I see no roads in the map. Where are they? And why are there two species outside of Tanzania?

We’ll look at that in our next section, but a hint of why can be seen in the outputs from Chunk 10. You will note that in Chunks 9 and 10 that st_read (the opposite of st_write) gives a brief summary of the spatial data being read into R.

3.4 Coordinate Reference Systems and Projections

The reason you can’t see the roads on the previous map can be illustrated as follows:

# Chunk 12
sfdat <- list("districts" = districts,  "roads" = roads, "species" = species)
sapply(sfdat, function(x) st_bbox(x))
#>       districts      roads   species
#> xmin  29.589529   478749.8  29.91544
#> ymin -11.762349 -1385619.0 -11.09595
#> xmax  40.444735  1591587.5  38.79323
#> ymax  -0.983143  -118296.2  -1.44546

In the code above, we add our 3 sf objects into a list (sfdat), and then use sapply to apply st_bbox to each list element to get their bounding boxes. That produces an output matrix that shows in the westernmost (xmin), southernmost (ymin), easternmost (xmax), and northernmost (ymax) coordinates of each object. This shows us that roads’ coordinates are in different units (meters) than districts’ and species’, which are decimal degrees. So when we tried to plot roads on the same map as districts and species, roads didn’t appear because the units are not the same. You could get a map of roads if you plotted them on their own.

3.4.1 Checking the CRS of an sf object

In this case, we have to do a bit more, which is to convert roads coordinate reference system (CRS) to decimal degrees. How do we do that? First, we have to know the crs we want to transform roads into, which we can do by checking the st_crs of species or districts:

# Chunk 13
st_crs(districts)$proj4string
#> [1] "+proj=longlat +datum=WGS84 +no_defs"
st_crs(roads)$proj4string
#> [1] "+proj=aea +lat_0=0 +lon_0=25 +lat_1=20 +lat_2=-23 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs"

We use st_crs, which is sf’s function for checking and assigning a CRS to an sf object. When you run it as we have done with st_crs(districts), it returns an object of class crs (you can verify that by running class(st_crs(districts))). We see that districts and roads have different values in their “PROJCRS” fields, which holds the projection string that contains the projection parameters of the CRS. If you want to understand the meaning of projection string parameters, please have a look at the documentation for the proj.4 library, specifically the page on parameters.

This one basically tells us that districts has a geographic (longlat) projection, and uses the WGS84 datum. Let’s check the crs for all three objects now, using our old friend sapply

# Chunk 14
sapply(sfdat, function(x) st_crs(x)$proj4string)
#>                                                                                        districts 
#>                                                            "+proj=longlat +datum=WGS84 +no_defs" 
#>                                                                                            roads 
#> "+proj=aea +lat_0=0 +lon_0=25 +lat_1=20 +lat_2=-23 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs" 
#>                                                                                          species 
#>                                                                                               NA

These results show us that species does not have a projection string. So how come it was able to plot on the same map as districts? Because the coordinates are in the same units. However, we want to fix that before we move to the next step, reprojecting:

# Chunk 15
st_crs(species) <- st_crs(districts)
#> Warning: st_crs<- : replacing crs does not reproject data; use st_transform for that
st_crs(sfdat$species) <- st_crs(districts)
#> Warning: st_crs<- : replacing crs does not reproject data; use st_transform for that
st_crs(sfdat$species)$proj4string
#> [1] "+proj=longlat +datum=WGS84 +no_defs"

Simple–we apply the fix to species and to sfdat$species by assigning each the proj4string from districts. We can do that because we are pretty sure that the coordinates for species are from a geographic coordinate system that uses the WGS84 datum. If that is incorrect, we are introducing some spatial error. But for now, this is fine.

Another note, when running st_crs, we are extracting the proj4string element from the resulting crs object. That is because it provides a more compact string representation of the CRS information. Running st_crs(sfdat$species) produces a much lengthier (and more informative) output in wkt format. Go ahead and try that.

3.4.2 Reprojecting

Okay, so now we want to change roads to the same CRS as districts.

# Chunk 16
sfdat$roads <- st_transform(x = sfdat$roads, crs = st_crs(districts))

We use sf‘s st_transform function to reproject sfdat$roads using the parameters supplied by districts’ p4s, which is passed into the function’s “crs” argument via the st_crs function applied to districts.

And that’s all. We can now redo our plots. I am going to show you two ways to do it. First, the way we have already done it with sf:plot:

# Chunk 17
par(mar = c(0, 0, 0, 0))
plot(sfdat$districts %>% st_geometry(), col = "grey")
plot(sfdat$roads %>% st_geometry(), col = "red", add = TRUE)
plot(sfdat$species %>% st_geometry(), col = "blue", pch = 20, add = TRUE)

And with ggplot:

# Chunk 18
ggplot(sfdat$districts) + geom_sf() +
  geom_sf(data = sfdat$roads, col = "red") + 
  geom_sf(data = sfdat$species, col = "blue")

So now you see the roads appearing as red lines on the map. ggplot adds a coordinate grid to the outside of the plot, and has a special function geom_sf that is designed to plot sf geometries. However, it is still substantially slower than sf::plot for the time being, so we’ll mostly stick with the latter.

3.5 Making simple features from scratch

Before ending this section, it is worth looking at how we can make our own sf objects from scratch, which we might need to do from time to time.

# Chunk 19
# #1
pt <- st_point(x = c(34, -6))
pt
#> POINT (34 -6)
#
# #2
pts <- st_multipoint(x = cbind(x = c(35.5, 36, 36.5), y = c(-5, -5.5, -6)))
pts
#> MULTIPOINT ((35.5 -5), (36 -5.5), (36.5 -6))
#
# #3
sline <- st_linestring(cbind(x = c(35, 35.5, 36), y = c(-5.5, -6, -6.5)))
sline
#> LINESTRING (35 -5.5, 35.5 -6, 36 -6.5)
#
# #4
pol <- st_polygon(list(cbind(x = c(34.5, 35.5, 35, 34, 34.5), 
                             y = c(-6, -7, -7.5, -6.5, -6))))
pol
#> POLYGON ((34.5 -6, 35.5 -7, 35 -7.5, 34 -6.5, 34.5 -6))
#
# 5
par(mar = c(0, 0, 0, 0))
plot(districts %>% st_geometry(), col = "grey")
plot(pt, add = TRUE, col = "red", pch = 20)
plot(pts, add = TRUE, col = "blue", pch = 20)
plot(sline, add = TRUE, col = "cyan", lwd = 2)
plot(pol, add = TRUE, col = "orange", lwd = 2)

We use st_point in #1 to create a single point object, based on a two-element vector that consists of the x and y coordinate. In #2, st_multipoint create an object with three points, based on an matrix containing the x coordinates in the first column and y coordinates in second column. #3 convert a similar matrix of coordinates into an st_linestring, and #4 creates a polygon object, also using a matrix as input–note that the first and last coordinates in both the x and y coordinates are the same, which is done in order to close the polygon. The printout of each sfg (simple feature geometry) shows that they store coordinates as point pairs separated by commas (except for pt, since it is just a single point). In #5 we plot each object over the districts polygons.

We’ll go bit further in making new features in the section on spatial operations

3.6 Practice

3.6.1 Questions

  1. What are the primary simple feature geometry types?

  2. How do you convert a tibble (or data.frame) into an sf object?

  3. When plotting (mapping) sf objects, what happens if your object has multiple variables?

  4. When constructing new sf objects, in what format do you have to provide inputs to the st_* constructor functions. How does the input object differ in the case of a POLYGON versus MULTIPOINT or LINESTRING?

3.6.2 Code

  1. Working with the species sf object, convert it to just a set of points without any other variables.

  2. Write out species with it CRS set to the same as districts into an sqlite file within your notebooks/data folder. Remove the species object from your workspace, and then read back into R the sqlite you just wrote out.

  3. Examine the class and str of districts and roads.

  4. Create a plot of roads–just the geometries–and make the color of the lines blue.

  5. Create a plot of districts, using dplyr::select to choose the distName variable from the objects data.frame. Set the title to “Tanzania Districts” using the “main” argument. Don’t specify a color.

  6. Create an st_multipoint object using x coordinates of c(35, 36, 37) and y coordinates of c(-5, -6, -7), and plot that over districts as orange points.

4 Spatial properties and attribute fields

We now begin to learn how to perform analyses with spatial vectors, starting with calculating basic spatial properties and manipulating the attribute tables associated with spatial vectors.

As you probably closed this project between Section 1 and 2, you will also need to reload the roads, districts, and species datasets (coercing the latter back to an sf object). The code in Section 1 will show you how to do that, but here is some code to recreate most of what we need:

# Chunk 20
roads <- system.file("extdata/roads.geojson", package = "sdadata") %>% st_read()
#> Reading layer `roads' from data source 
#>   `/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/roads.geojson' 
#>   using driver `GeoJSON'
#> Simple feature collection with 4569 features and 1 field
#> Geometry type: MULTILINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 478749.8 ymin: -1385619 xmax: 1591588 ymax: -118296.2
#> Projected CRS: Africa_Albers_Equal_Area_Conic
districts <- system.file("extdata/districts.geojson", package = "sdadata") %>% 
  st_read()
#> Reading layer `districts' from data source 
#>   `/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/districts.geojson' 
#>   using driver `GeoJSON'
#> Simple feature collection with 170 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS:  WGS 84
species <- system.file("extdata/species.csv", package = "sdadata") %>% 
  read_csv() %>% st_as_sf(coords = c("x", "y"), crs = 4326)
#> Rows: 6951 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): family, species
#> dbl (4): month, year, x, y
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This looks a bit different than what we did before. In this construction, we make use of dplyr pipes to do everything in one go for all three datasets. We pipe the file path to roads.geojson straight to sf::st_read(), and that reads in it as a one-liner. Same goes for districts (despite the line break). species gets the file path piped to read_csv, and then the coercion to sf happens right downstream. We add the crs = 4326 argument to st_as_sf to provide the missing projection information (4326 is the EPSG identifier for GCS WGS84)

4.1 Spatial properties

First, due to recent changes in sf, and how it handles spherical geometries, we have to set the following:

sf_use_s2(FALSE)
#> Spherical geometry (s2) switched off

Otherwise if we try to calculate lengths or create buffers of vectors that are in geographic coordinates systems, we get odd results.

4.1.1 Area

Using polygons to calculate area is a fairly basic but important spatial vector operation. To calculate meaningful areas, generally you need your data to be projected into a coordinate system in which area is one of the true properties. However, the function we are going to use, sf::st_area, will estimate area even for data in geographic coordinates.

# Chunk 21
# #1
dist_areas <- districts %>% st_area()
dist_areas
#
# #2
dist_areas %>% units::set_units("ha")
#
# #3
dist_areas %>% units::set_units("km^2")
#> Units: [m^2]
#>   [1]  1237181424   266991880  3237796747  8064122142  1265947533  6981462233 15909001482
#>   [8]   364969525   538906808   728705613  5630896694  9132303284  7289672419  2607595710
#>  [15]  5792245985  3958699261  7455226936  8055578543  3074286510  4707386048  2280950383
#>  [22]  1606578879 18784745856   369109423  9243791258   572527986  7514438395  7463660843
#>  [29]  1617874960    83240902  5133730982  2506182100  2824869889  3502512919  3305058442
#>  [36]   231069002   295025066   229584392   234454737 29625841684 16941638272   318438971
#>  [43]  1506332304  2208746511  7195183068   911823145 13406516414   967687278    92676984
#>  [50] 10054389973   902810020  1345846714    63391414  1967428180  1470230691  6220454054
#>  [57]  1216778726   219107154   256032980   505758018   379464273 15000069365  5974912721
#>  [64]  1063850901 34313758593  5973961212  2515594227  5569089606   470998471  3674639527
#>  [71] 13131616674  3799582883 19817537139  3092906601  2167876971  1069975601    64836217
#>  [78]  2001857930 11157052367  1534469486 17506256581   755895744 14439000656  2812972831
#>  [85]   252548054  2155115205   217763048    15456587  1971609350 13545763762 11774234334
#>  [92] 12458010586   288349054  6631589820 23461563741  4005226578   753345006  3692082480
#>  [99]   169900954  5203426450  1952675671  2048455953   254659437  3254313104  1517275875
#> [106]  2579364563   182725205  3152995098   628109329  5952783983   883468899  3996387124
#> [113]  3454262122  3536074287  3437196007  8455502522  1499486177   705286644  5031222977
#> [120]   642661933  2827362961 12747274544  5569504313  9824247730  5150341159  1373838352
#> [127]  4839969595 21765250729  3062559801 14045064690   578294562 18413517705  6754668703
#> [134]  1363202971  4182361771  3568623180   554679096  5484536065  1327969684  4689877393
#> [141]  3666986218  9018397768  8860767776  4520782523 28933748592  3120855417  2399966691
#> [148]   720561356  1880346226  3857652254  4708755744 11164654665    87466772  7064482800
#> [155] 15336076188  7864284875 26279536320  1460815874  5415791391 11967050744  6529182188
#> [162]   837413267  6433087358  3203142195   225328280  4091625158  2711929790  1497926427
#> [169]  1763138596   596500408
#> Units: [ha]
#>   [1]  123718.142   26699.188  323779.675  806412.214  126594.753  698146.223 1590900.148
#>   [8]   36496.952   53890.681   72870.561  563089.669  913230.328  728967.242  260759.571
#>  [15]  579224.599  395869.926  745522.694  805557.854  307428.651  470738.605  228095.038
#>  [22]  160657.888 1878474.586   36910.942  924379.126   57252.799  751443.839  746366.084
#>  [29]  161787.496    8324.090  513373.098  250618.210  282486.989  350251.292  330505.844
#>  [36]   23106.900   29502.507   22958.439   23445.474 2962584.168 1694163.827   31843.897
#>  [43]  150633.230  220874.651  719518.307   91182.315 1340651.641   96768.728    9267.698
#>  [50] 1005438.997   90281.002  134584.671    6339.141  196742.818  147023.069  622045.405
#>  [57]  121677.873   21910.715   25603.298   50575.802   37946.427 1500006.937  597491.272
#>  [64]  106385.090 3431375.859  597396.121  251559.423  556908.961   47099.847  367463.953
#>  [71] 1313161.667  379958.288 1981753.714  309290.660  216787.697  106997.560    6483.622
#>  [78]  200185.793 1115705.237  153446.949 1750625.658   75589.574 1443900.066  281297.283
#>  [85]   25254.805  215511.521   21776.305    1545.659  197160.935 1354576.376 1177423.433
#>  [92] 1245801.059   28834.905  663158.982 2346156.374  400522.658   75334.501  369208.248
#>  [99]   16990.095  520342.645  195267.567  204845.595   25465.944  325431.310  151727.588
#> [106]  257936.456   18272.520  315299.510   62810.933  595278.398   88346.890  399638.712
#> [113]  345426.212  353607.429  343719.601  845550.252  149948.618   70528.664  503122.298
#> [120]   64266.193  282736.296 1274727.454  556950.431  982424.773  515034.116  137383.835
#> [127]  483996.960 2176525.073  306255.980 1404506.469   57829.456 1841351.771  675466.870
#> [134]  136320.297  418236.177  356862.318   55467.910  548453.606  132796.968  468987.739
#> [141]  366698.622  901839.777  886076.778  452078.252 2893374.859  312085.542  239996.669
#> [148]   72056.136  188034.623  385765.225  470875.574 1116465.466    8746.677  706448.280
#> [155] 1533607.619  786428.488 2627953.632  146081.587  541579.139 1196705.074  652918.219
#> [162]   83741.327  643308.736  320314.220   22532.828  409162.516  271192.979  149792.643
#> [169]  176313.860   59650.041
#> Units: [km^2]
#>   [1]  1237.18142   266.99188  3237.79675  8064.12214  1265.94753  6981.46223 15909.00148
#>   [8]   364.96952   538.90681   728.70561  5630.89669  9132.30328  7289.67242  2607.59571
#>  [15]  5792.24599  3958.69926  7455.22694  8055.57854  3074.28651  4707.38605  2280.95038
#>  [22]  1606.57888 18784.74586   369.10942  9243.79126   572.52799  7514.43839  7463.66084
#>  [29]  1617.87496    83.24090  5133.73098  2506.18210  2824.86989  3502.51292  3305.05844
#>  [36]   231.06900   295.02507   229.58439   234.45474 29625.84168 16941.63827   318.43897
#>  [43]  1506.33230  2208.74651  7195.18307   911.82315 13406.51641   967.68728    92.67698
#>  [50] 10054.38997   902.81002  1345.84671    63.39141  1967.42818  1470.23069  6220.45405
#>  [57]  1216.77873   219.10715   256.03298   505.75802   379.46427 15000.06937  5974.91272
#>  [64]  1063.85090 34313.75859  5973.96121  2515.59423  5569.08961   470.99847  3674.63953
#>  [71] 13131.61667  3799.58288 19817.53714  3092.90660  2167.87697  1069.97560    64.83622
#>  [78]  2001.85793 11157.05237  1534.46949 17506.25658   755.89574 14439.00066  2812.97283
#>  [85]   252.54805  2155.11521   217.76305    15.45659  1971.60935 13545.76376 11774.23433
#>  [92] 12458.01059   288.34905  6631.58982 23461.56374  4005.22658   753.34501  3692.08248
#>  [99]   169.90095  5203.42645  1952.67567  2048.45595   254.65944  3254.31310  1517.27588
#> [106]  2579.36456   182.72520  3152.99510   628.10933  5952.78398   883.46890  3996.38712
#> [113]  3454.26212  3536.07429  3437.19601  8455.50252  1499.48618   705.28664  5031.22298
#> [120]   642.66193  2827.36296 12747.27454  5569.50431  9824.24773  5150.34116  1373.83835
#> [127]  4839.96960 21765.25073  3062.55980 14045.06469   578.29456 18413.51771  6754.66870
#> [134]  1363.20297  4182.36177  3568.62318   554.67910  5484.53606  1327.96968  4689.87739
#> [141]  3666.98622  9018.39777  8860.76778  4520.78252 28933.74859  3120.85542  2399.96669
#> [148]   720.56136  1880.34623  3857.65225  4708.75574 11164.65466    87.46677  7064.48280
#> [155] 15336.07619  7864.28488 26279.53632  1460.81587  5415.79139 11967.05074  6529.18219
#> [162]   837.41327  6433.08736  3203.14220   225.32828  4091.62516  2711.92979  1497.92643
#> [169]  1763.13860   596.50041

Very simple to calculate area for each polygon in districts. Note that the default is to calculate m\(^2\), but we can change that to hectares (in #2) using or km\(^2\) (#3) using the function units::set_units (the units package is imported by sf). Note that only numbers with units can be operated together (e.g. >|+...). This is sometimes annoying, but safe.

So this was an estimate of area from a non-project dataset. How does that compare to area from a true equal area projection, such as the Albers Equal Area projection that roads is projected in:

# Chunk 21
dist_areas_alb <- districts %>% 
  st_transform(crs = st_crs(roads)) %>% 
  st_area()
absd <- mean(abs(dist_areas_alb - dist_areas)) %>% units::set_units("ha")
pctd <- as.numeric(absd / (mean(dist_areas_alb) / 10000)) * 100 
absd
#> 0.1048738 [ha]
pctd
#> [1] 2.017064e-05

Using st_transform with the CRS from roads to reproject district, and then calculate polygons areas, we see the mean absolute difference between areas calculated from the projected and unprojected datasets is around 0 ha, which is a tiny percentage (0%) of the average district area. Not much to write home about, but also not nothing if absolute accuracy of area calculations is your interest.

# Chunk 22
districts %>% 
  st_area() %>% units::set_units("km^2") %>% 
  as.numeric() %>% 
  tibble(km2 = .) %>% 
  ggplot() + 
  geom_histogram(aes(x = km2), bins = 15, fill = "blue", col = "grey") + 
  xlab(bquote("km"^2))

Here’s a look at the district areas (in km\(^2\)) as a histogram. Notice the conversion steps. First we wanted to convert the calculated areas to a numeric vector (rather than one of class units), using as.numeric, then we made it into a tibble, because ggplot only works with data.frame, and then we plotted the histogram. There is one new addition there–in xlab, we use bquote to print a superscript 2 in the km\(^2\) axis label.

4.1.2 Distance

We can also calculate the distances between spatial features. To do that, we are make use of st_centroid on the Albers-projected districts data, which we use to find the central coordinates of each district, and then plot the centroids on top of the districts themselves.

# Chunk 23
# # 1
dist_cent <- districts %>% 
  st_transform(., st_crs(roads)) %>% 
  st_centroid()
#> Warning: st_centroid assumes attributes are constant over geometries
# dist_cent <- st_centroid(st_transform(districts, st_crs(roads)))  # equivalent

# #2
par(mar = c(0, 0, 0, 0))
districts %>% 
  st_transform(., st_crs(roads)) %>% 
  st_geometry() %>% 
  plot(col = "grey")
dist_cent %>% 
  st_geometry() %>% 
  plot(col = "red", pch = 20, add = TRUE)

Notice that right below the centroid calculation pipeline (#1) is a commented out line, which is the non-pipeline equivalent of the operation. In #2, we make the plots, setting up the plotting functions for both the districts polygons and their centroids on the downstream ends of pipelines.

Moving on, we can now find out how far apart the centroids of each district are by using another sf function, st_distance

# Chunk 24
# #1
all_dists <- st_distance(x = dist_cent)
all_dists[1:5, 1:5]
#> Units: [m]
#>           1         2        3         4         5
#> 1      0.00  10773.52 131963.4  76784.10  22154.92
#> 2  10773.52      0.00 129079.6  86863.41  26309.87
#> 3 131963.35 129079.64      0.0 147658.87 153955.95
#> 4  76784.10  86863.41 147658.9      0.00  84144.27
#> 5  22154.92  26309.87 153956.0  84144.27      0.00
#
# #2
d1_dists <- st_distance(x = dist_cent[1, ], y = dist_cent[-1, ])
d1_dists %>% as.vector()
#>   [1]  10773.52 131963.35  76784.10  22154.92  48224.76 147078.03 495663.62 474665.43 514089.82
#>  [10] 353886.60 362278.55 250294.34 340275.66 184677.86 316956.16 406549.64 527641.50 521549.31
#>  [19] 470564.45 468628.64 423168.10 516170.47 529745.02 529970.53 612782.56 628895.99 566686.88
#>  [28] 575019.66 560010.92 609614.35 659079.91 605944.44 562199.71 624234.34 373298.61 380520.11
#>  [37] 402797.82 410925.40 640600.59 700882.58 692258.56 709156.11 599117.31 662530.06 697844.47
#>  [46] 598375.94 736838.39 749211.47 714257.32  53531.44  76012.40  67721.81 108887.05  88347.56
#>  [55] 162428.80  52353.39 390902.41 394763.98 434482.54 458452.81 723316.19 841413.35 839203.18
#>  [64] 719158.31 849623.83 827749.66 124030.35 140713.62 200709.66 223305.43 161159.43 122529.49
#>  [73] 327921.31 343648.81 366292.29 369469.86 363178.41 257102.36 326563.80 618376.01 787743.39
#>  [82] 638355.70 743121.93 736326.31 762474.50 427043.00 424225.69 340853.12 598637.28 431458.14
#>  [91] 471510.98 419398.57 386654.23 675532.75 910230.56 901426.95 910836.04 898345.07 925203.27
#> [100] 911874.62 920602.72 397994.98 353407.25 351071.53 383560.81 397869.36 440686.55 412550.04
#> [109] 812784.29 681952.75 743611.48 712947.98 753428.93 709107.31 394265.91 454764.07 458753.23
#> [118] 501837.23 622868.53 530822.22 580031.95 809714.84 770461.52 758146.10 759308.41 895288.77
#> [127] 862583.81 948485.07 845917.06 870659.86 901352.65 450843.31 427350.51 305811.30 373118.50
#> [136] 337148.28 262813.88 321876.85 237240.89 300375.88 229361.05 308646.42 279916.68 430791.06
#> [145] 227841.42 240168.50 269285.41 797281.67 770562.91 784042.08 667158.50 811102.94 332184.61
#> [154] 546569.30 399846.38 489949.70 447580.36 523652.09 408873.43 311721.24 285328.88 272796.22
#> [163] 251161.26 284139.82 230940.61 285163.28 305912.46 346104.37 319510.29
#
# #3
d1_dists %>% 
  as.vector() %>% 
  quantile() / 1000
#>        0%       25%       50%       75%      100% 
#>  10.77352 321.87685 450.84331 692.25856 948.48507

There are two variants above. The first (#1) is captured in all_dists, and it produces a matrix that provides the distance, in meters, between each district centroid and every other district centroid (we show the first 5 rows and 5 columns of the matrix, otherwise it is 72X72). To get that, you simply pass in the dist_cent object into to the “x” argument. The second variant (#2) (d1_dists), uses both “x” and “y” arguments, and it simply finds the distance between the first district centroid and all the other districts’ centroids. We coerce the matrix to a vector for more compact printing. In #3, we summarize d1_dists using the quantile function, converting meters to kilometers.

4.1.3 Length

For the last example of functions that extract spatial properties, we will calculate the line lengths and perimeters.

# Chunk 25
# #1
## WARNING: when use as.numeric(), keep in mind the original unit
road_length <- st_length(roads) %>% as.numeric()
#
# #2
dist_perims <- st_length(districts) %>% as.numeric()
#
# #3
p1 <- ggplot(tibble(x = road_length), aes(x / 1000)) + 
  geom_histogram(fill = "blue", col = "grey", bins = 15) + xlab("km") + 
  ggtitle("Tanzania road segment lengths")
p2 <- ggplot(tibble(x = dist_perims), aes(x / 1000)) + 
  geom_histogram(fill = "purple", col = "grey", bins = 10) + xlab("km") + 
  ylab("") + ggtitle("Tanzania district perimeters")
cowplot::plot_grid(p1, p2)

We use st_length to calculate the length of the line segments in roads (#1) and the perimeter length of the polygons in district (#2), converting both to numeric vectors. We then plot each set of lengths as histograms (#3), converting units to kilometers (by dividing by 1000, within geom_histogram), and then use cowplot::plot_grid to arrange the two histograms side-by-side.

4.2 Manipulating attributes

sf objects stores attributes as data.frames or tibbles, and they can be manipulated using the same preparatory and analytical methods we learned about in the previous modules. For example, let’s consider some very basic subsetting:

# Chunk 26
par(mar = c(0, 0, 0, 0))
st_geometry(districts) %>% plot(col = "grey")
districts %>% 
  slice(1, 20, 40) %>% 
  plot(col = "blue", add = TRUE)
set.seed(123)
species %>% 
  sample_n(20) %>% 
  st_geometry() %>%  
  plot(col = "yellow", pch = "+", add = TRUE)

Notice the use of dplyr verbs to subset two datasets on the fly before they are piped into plot. The plot as setup with all districts as a backdrop (using st_geometry to strip away the attribute table), and then we overlay on that in blue the 1st, 20th, and 40th district, sliced out of districts. On top of that, we take a random sample of 20 observations from species, and plot those on top. You can also index into the sf objects with []:

# Chunk 27
par(mar = c(0, 0, 0, 0))
plot(st_geometry(districts), col = "grey")
plot(districts[c(1, 20, 40), ], col = "blue", add = TRUE)
set.seed(123)
plot(st_geometry(species[sample(1:nrow(species), 20), ]), col = "yellow", 
     pch = "+", add = TRUE)

The code above makes the identical plot as in Chunk 26, so we don’t display that again, but note the indexing, as well as the lack of piping and more conventional syntax.

4.2.1 Mutating and joining

The Spatial Properties showed how to calculate area, length, and other properties as separate outputs. Why not add them stick them in separate data objects. Now let’s see how to add those properties back the sf attribute tibble.

# Chunk 27
## WARNING: when use as.numeric(), keep in mind the original unit
districts %>% mutate(area = as.numeric(st_area(.) / 10^6))
#> Simple feature collection with 170 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>        distName                       geometry       area
#> 1        Arusha MULTIPOLYGON (((36.82231 -3...  1237.1814
#> 2  Arusha Urban MULTIPOLYGON (((36.63247 -3...   266.9919
#> 3        Karatu MULTIPOLYGON (((35.89154 -3...  3237.7967
#> 4       Longido MULTIPOLYGON (((36.35742 -3...  8064.1221
#> 5          Meru MULTIPOLYGON (((36.89951 -3...  1265.9475
#> 6       Monduli MULTIPOLYGON (((36.54092 -3...  6981.4622
#> 7    Ngorongoro MULTIPOLYGON (((35.32059 -1... 15909.0015
#> 8         Ilala MULTIPOLYGON (((39.21843 -6...   364.9695
#> 9     Kinondoni MULTIPOLYGON (((39.27007 -6...   538.9068
#> 10       Temeke MULTIPOLYGON (((39.3143 -6....   728.7056

So that’s a fairly straightforward way to do it. Just use mutate with st_area (note that st_area needs the . in it to function correctly). In making a new area column, we divide by 10\(^6\) to convert to km\(^2\), and coerce the result to numeric.

Imagine that someone gave you a separate dataset that simply listed district names and their areas for Tanzania, but provided no other spatial information. Say they did something like this:

# Chunk 28
dist_area <- tibble(distName = districts$distName, 
                    area = as.numeric(st_area(districts)) / 10^6)

We could use a *_join to merge this onto districts:

# Chunk 29
left_join(districts, dist_area, by = "distName")
#> Simple feature collection with 170 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>        distName       area                       geometry
#> 1        Arusha  1237.1814 MULTIPOLYGON (((36.82231 -3...
#> 2  Arusha Urban   266.9919 MULTIPOLYGON (((36.63247 -3...
#> 3        Karatu  3237.7967 MULTIPOLYGON (((35.89154 -3...
#> 4       Longido  8064.1221 MULTIPOLYGON (((36.35742 -3...
#> 5          Meru  1265.9475 MULTIPOLYGON (((36.89951 -3...
#> 6       Monduli  6981.4622 MULTIPOLYGON (((36.54092 -3...
#> 7    Ngorongoro 15909.0015 MULTIPOLYGON (((35.32059 -1...
#> 8         Ilala   364.9695 MULTIPOLYGON (((39.21843 -6...
#> 9     Kinondoni   538.9068 MULTIPOLYGON (((39.27007 -6...
#> 10       Temeke   728.7056 MULTIPOLYGON (((39.3143 -6....

In this case a left_join works because the number of records is identical and each row in each dataset has a unique district name.

4.2.2 More subsetting

We have already done a bit of this a couple of sections back, but let’s use this opportunity to do a bit more subsetting, including a previously unseen approach:

# Chunk 30
# #1
kanyi_dists <- districts %>% filter(grepl("^Ka|^Nyi", distName))
# 
# #2
dists_gt2m <- districts %>% filter(as.numeric(st_area(.) / 10^6) > 20000)

par(mar = c(0, 0, 0, 0))
plot(st_geometry(districts), col = "grey")
plot(st_geometry(kanyi_dists), col = rgb(0, 0, 1, alpha = 0.5), add = TRUE)
plot(st_geometry(dists_gt2m), col = rgb(1, 0, 0, alpha = 0.5), add = TRUE)

That’s two examples of subsetting that that are accomplished with filter. #1 is the new one, as it uses grepl to select districts with names beginning with either “Ka” or “Nyi”. grepl is one of several functions that make use of regular expressions to identify and do things (select, substitute, split, etc) to string variables. You can read up more on the subject by starting with ?grepl and ?regex. Regular expressions are quite powerful, and although they can be somewhat arcane, they are well worth learning (although we don’t do much with them here). #2 creates dists_gt2m) by calculating area within filter on the fly, and using that to select districts >20000 km\(^2\). The resulting subsets are plotted onto the full district map of Tanzania, using another function appearing for the first time in geospaar, rgb, which we use to define the colors for filling the polygons. rgb makes red-green-blue color combinations, and the alpha argument allows us to define how transparent the colors are. In this case, our two subsets selected some of the same polygons (two of the districts with names beginning with “Ka” or “Ny” are also larger than 20,000 km\(^2\)), thus the partial transparency is needed to show where the overlaps occur.

4.2.3 Split-apply-combine

As with plain old data.frames, we might want to do split-apply-combine operations on spatial data. Let’s demonstrate this selecting the districts in districts_alb2 that fall within certain area ranges, convert the selected districts to centroid points, and then plot them.

# Chunk 30
# #1
tertiles <- function(x) quantile(x, probs = c(0, 0.333, 0.667, 1))
dist_tertiles <- districts %>% 
  mutate(area = as.numeric(st_area(.)) / 10^6) %>%
  mutate(acls = cut(area, breaks = tertiles(area), include.lowest = TRUE)) %>% 
  group_by(acls) %>% 
  summarize(mean_area = mean(area))  
#> although coordinates are longitude/latitude, st_union assumes that they are planar
dist_tertiles
#> Simple feature collection with 3 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS:  WGS 84
#> # A tibble: 3 × 3
#>   acls                mean_area                                                             geometry
#>   <fct>                   <dbl>                                                   <MULTIPOLYGON [°]>
#> 1 [15.5,1.51e+03]          651. (((32.75855 -9.249037, 32.76249 -9.252446, 32.76649 -9.254365, 32.7…
#> 2 (1.51e+03,5.19e+03]     3146. (((31.4864 -7.460071, 31.4871 -7.461485, 31.48803 -7.465252, 31.487…
#> 3 (5.19e+03,3.43e+04]    11765. (((35.86454 -11.41009, 35.86409 -11.41199, 35.8631 -11.41347, 35.86…
#
# #2
par(mar = rep(0, 4))
plot(st_geometry(dist_tertiles), col = c("red", "purple", "blue"))
legend(x = "bottomleft", pch = 15, col = c("red", "purple", "blue"), 
       bty = "n", legend = c("Bottom third", "Middle third", "Largest third"))

The above is a dplyr-based split-apply-combine, which has a fair bit going on, so let’s explain:

  • We first create a new function called tertiles, which will divide any vector x it is given into thirds.
  • In the pipeline, the first line does the usual creation of an area variable. The second line use the cut function (see ?cut) to classify the area variable into thirds (i.e. districts whose areas are below the 33rd percentile area, those between the 33rd and 66th percentile areas, and those above the 66th percentile), using our tertile function, resulting in a new acls variable.
  • The third line in the pipeline groups the data by acls, and then calculates the mean area of each group. This summarize is spatial, in that it aggregates (dissolves) the polygons that are in group into a single large polygon for each group, while creating an output value describing the average area of the districts that were unioned to make each group. This is achieved by a generic summarize method defined for sf (see ?summarise.sf).
  • The plot shows the result of the union (plus a legend that we add)

We can do the same thing for standard deviation, but using a quintile grouping:

# Chunk 31
quintiles <- function(x) quantile(x, probs = seq(0, 1, 0.2))
dist_quintiles <- districts %>% 
  mutate(area = as.numeric(st_area(.)) / 10^6) %>%
  mutate(acls = cut(area, breaks = quintiles(area), include.lowest = TRUE)) %>% 
  group_by(acls) %>% 
  summarize(sd_area = sd(area))
#> although coordinates are longitude/latitude, st_union assumes that they are planar
dist_quintiles
#> Simple feature collection with 5 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS:  WGS 84
#> # A tibble: 5 × 3
#>   acls               sd_area                                                                geometry
#>   <fct>                <dbl>                                                      <MULTIPOLYGON [°]>
#> 1 [15.5,718]            192. (((32.75855 -9.249037, 32.76249 -9.252446, 32.76649 -9.254365, 32.7674…
#> 2 (718,2.11e+03]        408. (((33.44827 -9.187084, 33.44973 -9.191858, 33.45134 -9.195909, 33.4525…
#> 3 (2.11e+03,4e+03]      546. (((31.01872 -2.843939, 31.01967 -2.84478, 31.01979 -2.846578, 31.02241…
#> 4 (4e+03,7.9e+03]      1097. (((32.44603 -8.245552, 32.44556 -8.246121, 32.44575 -8.24695, 32.44642…
#> 5 (7.9e+03,3.43e+04]   6703. (((35.86124 -11.41664, 35.85951 -11.41725, 35.85778 -11.41777, 35.8556…

cols <- heat.colors(5)
par(mar = rep(0, 4))
plot(st_geometry(dist_quintiles), col = cols)
legend(x = "bottomleft", pch = 15, col = cols, bty = "n", 
       legend = paste0(1:5, c("st", "nd", "rd", "th", "th"), " quantile"))

Our plot in this case uses the heat.colors() function for filling the polygons.

That was a quick introduction to SAC for sf. There are others we could do (you can visit the tidyverse examples page for sf to see some others), but we will move on to the learn about other spatial operations now.

4.3 Practice

4.3.1 Questions

  1. How different are areas calculated using projected and unprojected (geographic) coordinate systems when using st_area? How come st_area “knows” how to calculate an area in m\(^2\) when the CRS is unprojected (the answer isn’t in the material above, but can be found if you read through ?st_area)?

  2. You will note that we haven’t used ggplot() + geom_sf() in this section. Why is that?

  3. How can we create a grouping variable that we can use to split an sf object according to different levels of spatial properties (e.g. area)?

4.3.2 Code

  1. Using a seed of 1, randomly select (sample_n) 10 districts from districts and calculate their average area in hectares.

  2. Using a seed of 1, randomly select (sample_n) 100 road segments from roads and calculate their average length in kilometers.

  3. Plot districts with a “lightgrey” background, and then randomly select (seed of 1) 200 observations from species of Syncerus caffer, and plot them as red circles over districts.

  4. From road select the roads that have lengths between 50 and 100 km and plot those in red over districts. Note that districts and roads are in different projections, so transform district to have the same CRS as roads first.

  5. Hack the code in Chunk 31 to create a map of districts merged by their deciles of area. Do the summary by total area. To do this, the quantile function will have to be altered (pass seq(0, 1, 0.1) to “probs”), and make sure your color vector has enough values.

5 Spatial Operations

In this section we will focus on some of the more common polygon-based spatial operations. To enable this, we are first going to create some additional spatial data.

# Chunk 32
# #1
coords <- cbind("x" = c(34.2, 34.3, 35.6, 35.6, 35.6, 34.2), 
                "y" = c(-4.8, -3.8, -3.9, -4.1, -4.9, -4.8))
#
# #2
pol <- st_polygon(x = list(coords)) %>% 
  st_sfc %>% 
  st_sf(ID = 1, crs = 4326)
pol
#> Simple feature collection with 1 feature and 1 field
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 34.2 ymin: -4.9 xmax: 35.6 ymax: -3.8
#> Geodetic CRS:  WGS 84
#>   ID                              .
#> 1  1 POLYGON ((34.2 -4.8, 34.3 -...

par(mar = rep(0, 4))
plot(st_geometry(districts), col = "grey")
plot(pol, col = "purple", add = TRUE)

The above takes coords, a matrix of x and y coordinates (#1), and repeats the approach we used in Chunk 19 (Section 3.6) to create an st_polygon (#2), but then continues on to convert that to a simple feature geometry list column (st_sfc), then to create an sf, a “data.frame-like object…with a simple feature list column” (?st_sf), including the assignment of a CRS. That’s how you build up a full simple feature with attributes.

5.1 Overlays/spatial queries

The first thing we will do with our made-up polygon is figure out which districts it intersects. What we are going to do in this case is the ArcGIS equivalent of select by spatial location (if I remember my ESRI terminology correctly–it’s been a while since I looked at it)

# Chunk 32
# #1
st_intersects(x = pol, y = districts)
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar
#> Sparse geometry binary predicate list of length 1, where the predicate was `intersects'
#>  1: 3, 13, 15, 68, 70, 72, 142, 143, 144, 146, ...
#
# #2  
dists_int <- districts %>% slice(st_intersects(x = pol, y = districts)[[1]])
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar
dists_int
#> Simple feature collection with 12 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 33.91347 ymin: -5.683134 xmax: 36.34537 ymax: -2.928766
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>    distName                       geometry
#> 1    Karatu MULTIPOLYGON (((35.89154 -3...
#> 2    Chemba MULTIPOLYGON (((35.35449 -4...
#> 3    Kondoa MULTIPOLYGON (((35.65407 -4...
#> 4    Babati MULTIPOLYGON (((35.72509 -3...
#> 5    Hanang MULTIPOLYGON (((34.92725 -4...
#> 6     Mbulu MULTIPOLYGON (((35.36185 -3...
#> 7     Meatu MULTIPOLYGON (((34.09986 -3...
#> 8    Ikungi MULTIPOLYGON (((34.59843 -4...
#> 9    Iramba MULTIPOLYGON (((34.50044 -3...
#> 10  Mkalama MULTIPOLYGON (((34.92725 -4...
# 
# #3
par(mar = rep(0, 4))
plot(st_geometry(districts), col = "grey")
plot(st_geometry(dists_int), col = "blue", add = TRUE)
plot(st_geometry(pol), border = "purple", add = TRUE)

In #1 we use st_intersects to identify the indices of the districts that intersect pol. So that doesn’t give us back the actual intersecting district polygons. To get those, we have to do a bit more. The output from st_intersects is a list object, which contains the index of the polygon in x (the “1:” in the output) followed by the polygons in the object passed to y that it intersects (15, 22, 26, 53, 54). We can simplify that results to a vector by specifying the list index we want (1): st_intersects(x = pol, y = districts)[[1]]. We plug that into slice in #2 to pull out the polygons from districts into a new object dists_int, and then #3 illustrates the intersected districts in blue.

5.2 Intersecting

We have just seen how st_intersects can be used to identify and extract intersecting polygons. Now let’s look at what st_intersection does.

# Chunk 33
pol_int_dists <- st_intersection(x = pol, y = districts)
#> although coordinates are longitude/latitude, st_intersection assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries

par(mar = rep(0, 4))
plot(st_geometry(districts), col = "grey")
plot(st_geometry(pol_int_dists), col = rainbow(n = nrow(pol_int_dists)), 
     add = TRUE)

Seems pretty clear what st_intersection does, and how it differs from st_intersects: it crops the intersected districts to the boundaries of pol, and returns the intersected district polygons. It also spits out a warning about the attribute variable being spatially constant, as well as the usual warning about the data not being in planar coordinates. We also introduced another color function (rainbow) to distinguish the different intersected districts.

5.3 Aggregating/Dissolving

Now that we have intersected boundaries, let’s dissolve them. We’ve actually already seen this (in section 4.2.3), but let’s have a second look:

# Chunk 34
# #1
pol_int_dists %>% summarize
#> although coordinates are longitude/latitude, st_union assumes that they are planar
#> Simple feature collection with 1 feature and 0 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 34.2 ymin: -4.9 xmax: 35.6 ymax: -3.8
#> Geodetic CRS:  WGS 84
#>                                .
#> 1 POLYGON ((34.89974 -4.84998...
# 
# #2
pol_int_muarea <- pol_int_dists %>% 
  mutate(area = as.numeric(units::set_units(st_area(.), "km^2"))) %>% 
  summarize(area = mean(area))
#> although coordinates are longitude/latitude, st_union assumes that they are planar
pol_int_muarea
#> Simple feature collection with 1 feature and 1 field
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 34.2 ymin: -4.9 xmax: 35.6 ymax: -3.8
#> Geodetic CRS:  WGS 84
#>       area                              .
#> 1 1385.954 POLYGON ((34.89974 -4.84998...

par(mfrow = c(1, 2))
pol_int_dists %>% 
  st_geometry() %>% 
  plot(col = rainbow(n = nrow(pol_int_dists)), main = "Intersections")
pol_int_muarea %>% 
  st_geometry() %>% 
  plot(col = "blue", main = "Dissolved intersections")

sf’s generic for summarize does the job for us. In #1, we run summarize without specifying a variable to summarize, and it returns a field-less (attribute-less) single polygon (which is identical to pol, but not mapped). In #2, we add a variable that is meaningful to summarize, which is the area of the intersected districts. Our summary is the average area of those intersected districts.

For a GIS person, it might be confusing to run a function called summarize when you specifically want to do a dissolve operation. Shouldn’t there be a function named that, or something close to it? Well, there is one: st_union, which is actually called by summarise.sf (which is called by the generic summarize or summarise in the presence of an sf object) by default. We can see its behavior here:

# Chunk 35
pol_int_dists %>% st_union()

Same result, although in this case it produces a geometry set, and, unlike summarize, st_union does not retain information about the data values associated with the polygons.

5.4 Casting

Sometimes a spatial operation will result in mixed geometry types within a single sf object, or a more complex geometry type than you actually need. You might need to make all the geometries in the object uniform (to prevent problems with other functions that don’t want mixed geometry types), or simplify. Doing this is the job of st_cast. Let’s look again at pol_int_dists

# Chunk 36
pol_int_dists
#> Simple feature collection with 12 features and 2 fields
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: 34.2 ymin: -4.9 xmax: 35.6 ymax: -3.8
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>     ID distName                              .
#> 1    1   Karatu POLYGON ((35.04069 -3.85697...
#> 1.1  1   Chemba POLYGON ((35.33888 -4.88134...
#> 1.2  1   Kondoa POLYGON ((35.6 -4.9, 35.433...
#> 1.3  1   Babati POLYGON ((35.6 -4.1, 35.6 -...
#> 1.4  1   Hanang MULTIPOLYGON (((35.6 -4.696...
#> 1.5  1    Mbulu POLYGON ((35.6 -3.9, 35.6 -...
#> 1.6  1    Meatu POLYGON ((34.3 -3.8, 34.71 ...
#> 1.7  1   Ikungi MULTIPOLYGON (((34.20477 -4...
#> 1.8  1   Iramba POLYGON ((34.27165 -4.08353...
#> 1.9  1  Mkalama POLYGON ((34.75013 -3.83462...

rbcols <- rainbow(n = nrow(pol_int_dists))
par(mar = rep(0, 4))
pol_int_dists %>% 
  st_geometry %>% 
  plot(col = rbcols)

Notice how the object contains three polygons and one multipolygon, which is the blue district (Mufumbwe), which needs to be MULTIPOLYGON because the light green district (Kaoma) drives a wedge between it, making it two pieces. Let’s run st_cast:

# Chunk 37 
pol_int_dists %>% st_cast
#> Simple feature collection with 12 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 34.2 ymin: -4.9 xmax: 35.6 ymax: -3.8
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>     ID distName                              .
#> 1    1   Karatu MULTIPOLYGON (((35.04069 -3...
#> 1.1  1   Chemba MULTIPOLYGON (((35.33888 -4...
#> 1.2  1   Kondoa MULTIPOLYGON (((35.6 -4.9, ...
#> 1.3  1   Babati MULTIPOLYGON (((35.6 -4.1, ...
#> 1.4  1   Hanang MULTIPOLYGON (((35.6 -4.696...
#> 1.5  1    Mbulu MULTIPOLYGON (((35.6 -3.9, ...
#> 1.6  1    Meatu MULTIPOLYGON (((34.3 -3.8, ...
#> 1.7  1   Ikungi MULTIPOLYGON (((34.20477 -4...
#> 1.8  1   Iramba MULTIPOLYGON (((34.27165 -4...
#> 1.9  1  Mkalama MULTIPOLYGON (((34.75013 -3...

That converts all POLYGONS to MULTIPOLYGONs, the simplest common geometry type that can be shared by all 4 geometry features. One can also specify the type of geometry you want it to be:

# Chunk 38
pol_int_dists %>% st_cast("POLYGON")
#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
#> Simple feature collection with 12 features and 2 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 34.2 ymin: -4.9 xmax: 35.6 ymax: -3.8
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>     ID distName                              .
#> 1    1   Karatu POLYGON ((35.04069 -3.85697...
#> 1.1  1   Chemba POLYGON ((35.33888 -4.88134...
#> 1.2  1   Kondoa POLYGON ((35.6 -4.9, 35.433...
#> 1.3  1   Babati POLYGON ((35.6 -4.1, 35.6 -...
#> 1.4  1   Hanang POLYGON ((35.6 -4.696429, 3...
#> 1.5  1    Mbulu POLYGON ((35.6 -3.9, 35.6 -...
#> 1.6  1    Meatu POLYGON ((34.3 -3.8, 34.71 ...
#> 1.7  1   Ikungi POLYGON ((34.20477 -4.75227...
#> 1.8  1   Iramba POLYGON ((34.27165 -4.08353...
#> 1.9  1  Mkalama POLYGON ((34.75013 -3.83462...

It throws a warning, because Mufumbwe can’t be reduced to a POLYGON without jettisoning one of two pieces, so it let’s us know that it had to do that. We can see the result:

# Chunk 39
par(mfrow = c(1, 3), mar = c(0, 0, 1, 0))
pol_int_dists %>% 
  st_geometry %>% 
  plot(col = rbcols, main = "Uncast")
pol_int_dists %>% 
  st_cast() %>% 
  st_geometry %>% 
  plot(col = rbcols, main = "Cast to MULTIPOLYGON")
pol_int_dists %>% 
  st_cast("POLYGON") %>% 
  st_geometry %>% 
  plot(col = rbcols, main = "Cast to POLYGON")
#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

No difference between the original pol_int_dists and the straight cast of it, but when we try cast to POLYGON, the bottom part of the blue district disappears.

Plotting tip: note how we add titles to the plots and also tweak the margins to add space to the top margin so that those titles can be displayed.

We can also cast the polygons to lines, and even to points:

# Chunk 40
par(mfrow = c(1, 2), mar = c(0, 0, 1, 0))
pol_int_dists %>% 
  st_cast("MULTILINESTRING") %>% 
  st_geometry %>% 
  plot(col = rbcols, main = "Cast to MULTILINESTRING")
pol_int_dists %>% 
  st_cast("MULTIPOINT") %>% 
  st_geometry %>% 
  plot(col = rbcols, main = "Cast to MULTIPOINT", pch = 21, cex = 0.5)

#> null device 
#>           1

So that’s st_cast. Not really a spatial operation per se, but can be important for transitioning between spatial operations, as indicated by this passage from sf’s 3rd vignette:

When functions return objects with mixed geometry type (GEOMETRY), downstream functions such as st_write may have difficulty handling them. For some of these cases, st_cast may help modifying their type. For sets of geometry objects (sfc) and simple feature sets (sf),st_cast` can be used by specifying the target type, or without specifying it.

To learn more st_cast and its behavior, please read that vignette.

5.5 Differencing

Now we are going to move on to look at another bread and butter operation, differencing (or erasing).

# Chunk 41
d1 <- st_difference(x = districts, y = pol)
#> although coordinates are longitude/latitude, st_difference assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
d2 <- st_difference(x = pol, y = districts)
#> although coordinates are longitude/latitude, st_difference assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries

par(mfrow = c(1, 2), mar = c(0, 0, 1, 0))
d1 %>% 
  st_geometry %>% 
  plot(col = "grey")
d2 %>% 
  st_geometry %>% 
  plot(col = "grey")

d1 erases pol, leaving a hole, whereas d2, by reversing the order of objects into the “x” and “y” arguments, does the same thing as an intersection, by erasing all the districts surrounding pol.

5.6 Unioning

We have already seen st_union in action to dissolve internal polygon boundaries. Now let’s look at unioning two objects.

# Chunk 42
# #1
uni1 <- st_union(x = pol, y = districts)
#> although coordinates are longitude/latitude, st_union assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
uni1
#> Simple feature collection with 170 features and 2 fields
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>     ID     distName                              .
#> 1    1       Arusha MULTIPOLYGON (((34.2 -4.8, ...
#> 1.1  1 Arusha Urban MULTIPOLYGON (((34.2 -4.8, ...
#> 1.2  1       Karatu POLYGON ((34.3 -3.8, 34.750...
#> 1.3  1      Longido MULTIPOLYGON (((34.2 -4.8, ...
#> 1.4  1         Meru MULTIPOLYGON (((34.2 -4.8, ...
#> 1.5  1      Monduli MULTIPOLYGON (((34.2 -4.8, ...
#> 1.6  1   Ngorongoro MULTIPOLYGON (((34.2 -4.8, ...
#> 1.7  1        Ilala MULTIPOLYGON (((34.2 -4.8, ...
#> 1.8  1    Kinondoni MULTIPOLYGON (((34.2 -4.8, ...
#> 1.9  1       Temeke MULTIPOLYGON (((34.2 -4.8, ...
#
# #2
uni2 <- st_union(x = districts, y = pol)
#> although coordinates are longitude/latitude, st_union assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
uni2
#> Simple feature collection with 170 features and 2 fields
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>        distName ID                       geometry
#> 1        Arusha  1 MULTIPOLYGON (((36.82231 -3...
#> 2  Arusha Urban  1 MULTIPOLYGON (((36.63247 -3...
#> 3        Karatu  1 POLYGON ((35.88634 -3.26032...
#> 4       Longido  1 MULTIPOLYGON (((36.35742 -3...
#> 5          Meru  1 MULTIPOLYGON (((36.89951 -3...
#> 6       Monduli  1 MULTIPOLYGON (((36.54092 -3...
#> 7    Ngorongoro  1 MULTIPOLYGON (((35.32059 -1...
#> 8         Ilala  1 MULTIPOLYGON (((39.21843 -6...
#> 9     Kinondoni  1 MULTIPOLYGON (((39.27007 -6...
#> 10       Temeke  1 MULTIPOLYGON (((39.3143 -6....
#
# #3
par(mfrow = c(1, 2), mar = c(0, 0, 1, 0))
plot(uni1[2], reset = FALSE, main = "Union of pol with districts")
plot(uni2[2], reset = FALSE, main = "Union of districts with pol")

Two different order to unioning pol and districts (#1 and #2). The ordering simply results in the ordering of the attribute fields in the resulting unioned datasets (the fields from the object in the “x” argument appear first) after the overlapping polygons are merged. You can see the difference in the ordering in the resulting plots (#3), in which for the first time we use index notation to specify the field of each object that we want the plot to render. In this case, we select the second field, which is different in each unioned object, hence the different fill patterns. This index-based referencing is a good way to avoid having to use st_geometry, while letting sf::plot do the work of selecting a color scheme.

5.7 Buffering

This is a fairly common GIS operation that sf handles nicely.

# Chunk 43
# 
# #1
species_alb <- species %>% st_transform(st_crs(roads))
long_roads <- roads %>% filter(as.numeric(st_length(.)) > 50000)
districts_alb <- districts %>% st_transform(st_crs(roads))

# #2
buffs <- lapply(c(10000, 20000, 30000), function(x) { 
  set.seed(123)
  #2a
  buf_species <- species_alb %>% filter(species == "Syncerus caffer") %>%
    st_sample(size = 5) %>% st_buffer(dist = x) # here is the buffer
  #2b
  buf_roads <- long_roads %>% st_buffer(dist = x)  # here is the buffer
  #2c
  list("species" = buf_species, "roads" = buf_roads)
})
#
# using purrr::map
# buffs <- c(10000, 20000, 30000) %>% map(function(x) { 
#   set.seed(123)
#   #2a
#   buf_species <- species_alb %>% filter(species == "Syncerus caffer") %>%
#     st_sample(size = 5) %>% st_buffer(dist = x)
#   #2b
#   buf_roads <- long_roads %>% st_buffer(dist = x)
#   list("species" = buf_species, "roads" = buf_roads)
# })
#
# #3
par(mar = c(0, 0, 0, 0))
cols <- c("cyan", "blue2", "orange")  # 2
cols2 <- c("purple", "green4", "antiquewhite")
districts_alb %>% st_union %>% plot(col = "grey")
for(i in length(buffs):1) {  # 4
  plot(buffs[[i]]$roads, add = TRUE, col = cols2[i])  # 6
  plot(buffs[[i]]$species, add = TRUE, col = cols[i])  # 5
}

A fair bit of code up there, so let’s walk through it. In #1 we convert species and districts to Albers projections, as buffers should be specified in meters, not decimal degrees, which one would have to do if buffering on a GCS project. We also extract from roads the segments that are longer than 50 km. In #2, we then insert our buffering step into an lapply, as we apply three different buffer widths (10, 20, and 30 km), specified in meters (the unit of the Albers projection) into st_buffer. We first buffer around 5 randomly selected species from the second season, using st_sample instead of sample_n (which we did previously), because it allows merging of buffers where they touch, whereas a sample drawn with sample_n would apply a separate buffer to each point. We then apply st_buffer to the selected roads. The commented out chunk of code shows how we can use purrr::map to do the same work as lapply, using piping to pass the iteration vector into the map function (you can also write it that way for lapply).

In #3 we draw the plots, setting up an outline of Tanzania by dissolving district boundaries, then plotting the buffered shapes in a for loop (note the reversed index ordering, to plot from largest to smallest buffers).

5.8 Sampling

We have seen a little bit of sampling already, but the previous examples just sample from the features in an existing object. Sometimes you might be interested in confining the sample to a particular feature or features, for example:

# Chunk 44
par(mar = rep(0, 4))
districts %>% 
  st_union %>% 
  plot(col = "grey")
set.seed(1)
districts %>% 
  st_union %>% 
  st_sample(size = 100, exact = TRUE) %>% 
  plot(pch = 20, add = TRUE)

That uses st_sample to pull a random sample of 100 points from within the merged district. We set the exact = TRUE argument to force the returned sample to be the exact number we specify (otherwise it can sometimes be a bit less).

Let’s look at this in a bit more depth:

# Chunk 45
# #1
seed <- 40
set.seed(seed)
dists_7 <- districts %>% sample_n(7)
#
# #2
set.seed(seed)
samp14 <- dists_7 %>% st_sample(size = 14, exact = TRUE)
# 
# #3
set.seed(seed)
samp7_2 <- dists_7 %>% st_sample(size = rep(2, nrow(dists_7)), exact = TRUE)

par(mar = rep(0, 4))
districts[2] %>% plot(col = "grey", reset = FALSE)
dists_7 %>% plot(col = "khaki", add = TRUE)
samp14 %>% plot(col = "red", pch = 20, add = TRUE)
samp7_2 %>% plot(col = "blue", pch = "+", add = TRUE)

In #1 we start by randomly selecting 7 districts from districts (using sample_n). We then use st_sample to randomly select 14 points from within those 7 randomly selected districts (#2). In #3, we specify that we want to randomly select 2 points within each of the 7 districts–to do that, we create a vector containing 2 repeated as many times as there are rows in dists_7, so the sample size is specified for each of the 7 sampled districts. The output plots show the differences between the two sampling strategies. The approach used in #2 distributes an unequal number of points across the 7 districts, and misses one district entirely. The approach in #3 places two points in each of the 7 districts. It is not hard to see that one could go from here to create a stratified random sample in which the size was made proportional to each district’s area.

So that’s it for this unit, which has given a brief and by no means exhaustive introduction to do some common vector operations and ways in which vector data are manipulated.

5.9 Practice

5.9.1 Questions

  1. Why it is the purpose of st_cast, and why might we need to cast features?

  2. What spatial operation occurs when summarize is applied to an sf object?

  3. In what way does the order of unioning affect the result of st_union on two spatial objects?

5.9.2 Code

  1. Using the code in Chunk 32 as your template, create a rectangular sf polygon called pol2 using the x coordinates 35 and 35.5 (minimum and maximum) and y coordinates -4 and -4.5. Coordinates vectors should be ordered like this for x: xmin, xmax, xmax, xmin, xmin. And for y: ymin, ymin, ymax, ymax, ymin. In both cases, think of min and max in absolute terms (i.e. drop the negative sign). Plot the result on top of districts in blue. It should look like this:

  1. Use your new pol2 to do an st_intersection with districts. Call it pol2_int_dists. Plot it using rainbow colors on top of grey districts.

  2. Calculate the difference between 1) the sum of areas calculated from pol2_int_dists and 2) the area of pol2.

  3. Use st_difference to create a pol2 sized hole in districts. Do the difference on the fly (i.e. don’t save it), and pipe the result to plot(col = "grey")

  4. From Chunk 43, recreate species_alb, and use the code in #2a to recreate buffers of 30 km width around the 5 randomly selected species points, and plot those. Recreate the buffers, but use sample_n to do the random draw. Plot those and compare the difference.

  5. Select from roads the roads that are greater than 40 km long, and save the result in roads_gt40. Place a 25 km buffer around those roads. Plot the result over the Albers-projected districts (in grey) in green.

  6. Select a random sample of exactly 20 points within each of the 25 km buffers around roads_gt40. Plot the resulting points as red circles over the tan buffered roads and grey districts.

6 Unit assignment

6.1 Set-up

Make sure you are working in the main branch of your project (you should not be working on the a3 branch). Create a new vignette named “unit2_module1.Rmd”. You will use this to document the tasks you undertake for this assignment.

There are no package functions to create for this assignment. All work is to be done in the vignette.

6.2 Vignette tasks

  1. Read in the species.csv, districts.geojson, and roads.geojson datasets. Convert it to an sf object. Reproject the species and districts data to Albers projection (using the CRS from roads), naming each species_alb and districts_alb. Ideally you will do all the necessary steps to create species_alb and districts_alb in one pipeline (worth an extra 0.5 points).

  2. Create a plot using sf::plot that shows all three datasets on one map, with districts_alb in grey, with roads in red over that, and species_alb as a blue cross over that. Use the relevant chunk arguments to center the figure in the vignette html, and to have a height of 4 inches and a width of 6 inches (Hints: fig.align, fig.height, fig.width). The figure should have 0 margins all around (Hint: par).

  3. Make the same plot above using ggplot and geom_sf. When adding species_alb to the plot, use pch = "+" and size = 3 as arguments to geom_sf. Add the function theme_bw() to the ggplot construction chain, to get rid of the grey background. Make the “fill” (rather than “color”) of districts_alb grey. Center the figure using chunk options and make the figure width 5 inches and height 6 inches.

  4. Select from districts_alb the district representing the 50th percentile area, i.e. the median area, and save that district into a new object median_dist. Plot it in “khaki” on top of grey districts_alb. Give the plot a title “The median area district”. Same plot dimensions in the vignette html as for Task 2, but a leave a space of 1 at the top in the plot mar.

  5. Convert the median_dist to its centroid point. Call it median_distp. filter the species_alb data for species belonging to family “Felidae”, and then find the 20 closest Felidae species to median_distp. To do that, create the new object closest_20species by using mutate with st_distance to create a new variable dist (convert it to numeric), and then arrange by variable dist and slice out the top 20 observations. Use ggplot2 to plot districts_alb in grey, median_dist over that in khaki, median_distp as a solid purple circle, species_alb in blue, and closest_20species in red. Zero margins (Hint: use theme(plot.margin = margin(t = 0, r = 0, b = 0, l = 0, unit = "cm")) in the ggplot2 pipeline) and width of 6 inches and height of 4 inches.

  6. Create a rectangular sf polygon called mypol using the x coordinates 32 and 33 (minimum and maximum) and y coordinates -5 and -6. Assign it crs = 4326 and transform it to Albers. Select from districts_alb the districts that intersect mypol, and plot in “grey40” over districts_alb in grey, and plot over that mypol without any fill but just a yellow border. Zero margins and width of 6 inches and height of 4 inches. Calculate the area in ha of mypol and report it in your vignette below this plot.

  7. Create mypol_dist_int from the intersection of mypol and districts_alb, recasting the intersected districts to multipolygons, and adding an area variable onto it that reports areas of intersections in hectares. Do all that in one pipeline. Use sf::plot to plot mypol_dist_int in rainbow colors over districts_alb in grey. Zero margins and width of 6 inches and height of 4 inches. Report the mean and median of intersections in the vignette below the plot.

  8. Find the shortest and longest road segments in Tanzania, and place the selected roads into a new object (roads_extreme). To do this, you will need to arrange roads by length and then slice to get the first and last observations (of course you need to first calculate length). Do that as one pipeline. Then calculate a 50 km buffer around those two roads (roads_extreme_buff). Use sf::plot to plot roads_extreme_buff in blue over districts_alb in grey, and add roads_extreme on top of that as red lines (use lwd = 3 in the plot). Zero margins and width of 6 inches and height of 4 inches.

  9. Select a random sample of 10 points in the smallest object, and one of 50 in the largest object in roads_extreme_buff. Use a single call to st_sample to do that (Hint: check the help of function st_sample). Use a seed of 2. Use sf::plot to plot those points as yellow solid points over the same map created in Task 8 above. Use the same dimensions.

  10. Your final task is to intersect roads with the buffer of the longest road in roads_extreme_buff (name the result as roads_int). Plot the buffer of the longest road in blue over the districts in grey, and then roads_int as red lines. Use the same dimensions as the previous two plots. Report the total distance of intersected roads in km in the vignette below the plot.

6.3 Assignment output

Refer to Unit 1 Module 4’s assignment output section for submission instructions. The only differences are as follows:

  1. You should increment your package version number by 1 on the lowest number (e.g. from 0.1.2 to 0.1.3) in the DESCRIPTION;

  2. When asked to report the result of calculations in the vignette text (e.g. in Task 10), you can use another RMarkdown trick, which is to let the calculation be done by code inserted between a single pair of backticks, with the first backtick followed immediately by r and a space. See here for more explanation of that.

  3. Revise the setting chunk in the beginning of your vignette to:

    knitr::opts_chunk$set(
      collapse = TRUE,
      comment = "#>",
      warning = FALSE,
      message = FALSE
    )

    to avoid unnecessary warning and message.

  4. Update vignette title and this line in the YAML header to %\VignetteIndexEntry{Unit 2 Module 1 assignment 4}. When you browse the vignettes of your package, you should see a page in your browser like this:

  1. Your submission should be on a new side branch “a4”.

Back to home